62 research outputs found

    Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network

    Get PDF
    Naturally occurring networks exhibit quantitative features revealing underlying growth mechanisms. Numerous network mechanisms have recently been proposed to reproduce specific properties such as degree distributions or clustering coefficients. We present a method for inferring the mechanism most accurately capturing a given network topology, exploiting discriminative tools from machine learning. The Drosophila melanogaster protein network is confidently and robustly (to noise and training data subsampling) classified as a duplication-mutation-complementation network over preferential attachment, small-world, and other duplication-mutation mechanisms. Systematic classification, rather than statistical study of specific properties, provides a discriminative approach to understand the design of complex networks.Comment: 19 pages, 5 figure

    Predicting Genetic Regulatory Response Using Classification

    Full text link
    We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences (``motifs'') in the gene's regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment (``parents''). Thus our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurement to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S. cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on held-out experiments. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks.Comment: 8 pages, 4 figures, presented at Twelfth International Conference on Intelligent Systems for Molecular Biology (ISMB 2004), supplemental website: http://www.cs.columbia.edu/compbio/geneclas

    Information-theoretic approach to network modularity

    Get PDF
    Exploiting recent developments in information theory, we propose, illustrate, and validate a principled information-theoretic algorithm for module discovery and the resulting measure of network modularity. This measure is an order parameter (a dimensionless number between 0 and 1). Comparison is made with other approaches to module discovery and to quantifying network modularity (using Monte Carlo generated Erdös-like modular networks). Finally, the network information bottleneck (NIB) algorithm is applied to a number of real world networks, including the “social” network of coauthors at the 2004 APS March Meeting

    Systematic identification of statistically significant network measures

    Get PDF
    We present a graph embedding space (i.e., a set of measures on graphs) for performing statistical analyses of networks. Key improvements over existing approaches include discovery of “motif hubs” (multiple overlapping significant subgraphs), computational efficiency relative to subgraph census, and flexibility (the method is easily generalizable to weighted and signed graphs). The embedding space is based on scalars, functionals of the adjacency matrix representing the network. Scalars are global, involving all nodes; although they can be related to subgraph enumeration, there is not a one-to-one mapping between scalars and subgraphs. Improvements in network randomization and significance testing—we learn the distribution rather than assuming Gaussianity—are also presented. The resulting algorithm establishes a systematic approach to the identification of the most significant scalars and suggests machine-learning techniques for network classification

    Discriminative Topological Features Reveal Biological Network Mechanisms

    Get PDF
    Recent genomic and bioinformatic advances have motivated the development of numerous random network models purporting to describe graphs of biological, technological, and sociological origin. The success of a model has been evaluated by how well it reproduces a few key features of the real-world data, such as degree distributions, mean geodesic lengths, and clustering coefficients. Often pairs of models can reproduce these features with indistinguishable fidelity despite being generated by vastly different mechanisms. In such cases, these few target features are insufficient to distinguish which of the different models best describes real world networks of interest; moreover, it is not clear a priori that any of the presently-existing algorithms for network generation offers a predictive description of the networks inspiring them. To derive discriminative classifiers, we construct a mapping from the set of all graphs to a high-dimensional (in principle infinite-dimensional) ``word space.'' This map defines an input space for classification schemes which allow us for the first time to state unambiguously which models are most descriptive of the networks they purport to describe. Our training sets include networks generated from 17 models either drawn from the literature or introduced in this work, source code for which is freely available. We anticipate that this new approach to network analysis will be of broad impact to a number of communities.Comment: supplemental website: http://www.columbia.edu/itc/applied/wiggins/netclass

    A classification-based framework for predicting and analyzing gene regulatory response

    Get PDF
    BACKGROUND: We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem — predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. METHODS: In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data. RESULTS: Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast — the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors — and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from

    Measurement of the cosmic ray spectrum above 4×10184{\times}10^{18} eV using inclined events detected with the Pierre Auger Observatory

    Full text link
    A measurement of the cosmic-ray spectrum for energies exceeding 4×10184{\times}10^{18} eV is presented, which is based on the analysis of showers with zenith angles greater than 6060^{\circ} detected with the Pierre Auger Observatory between 1 January 2004 and 31 December 2013. The measured spectrum confirms a flux suppression at the highest energies. Above 5.3×10185.3{\times}10^{18} eV, the "ankle", the flux can be described by a power law EγE^{-\gamma} with index γ=2.70±0.02(stat)±0.1(sys)\gamma=2.70 \pm 0.02 \,\text{(stat)} \pm 0.1\,\text{(sys)} followed by a smooth suppression region. For the energy (EsE_\text{s}) at which the spectral flux has fallen to one-half of its extrapolated value in the absence of suppression, we find Es=(5.12±0.25(stat)1.2+1.0(sys))×1019E_\text{s}=(5.12\pm0.25\,\text{(stat)}^{+1.0}_{-1.2}\,\text{(sys)}){\times}10^{19} eV.Comment: Replaced with published version. Added journal reference and DO

    Energy Estimation of Cosmic Rays with the Engineering Radio Array of the Pierre Auger Observatory

    Full text link
    The Auger Engineering Radio Array (AERA) is part of the Pierre Auger Observatory and is used to detect the radio emission of cosmic-ray air showers. These observations are compared to the data of the surface detector stations of the Observatory, which provide well-calibrated information on the cosmic-ray energies and arrival directions. The response of the radio stations in the 30 to 80 MHz regime has been thoroughly calibrated to enable the reconstruction of the incoming electric field. For the latter, the energy deposit per area is determined from the radio pulses at each observer position and is interpolated using a two-dimensional function that takes into account signal asymmetries due to interference between the geomagnetic and charge-excess emission components. The spatial integral over the signal distribution gives a direct measurement of the energy transferred from the primary cosmic ray into radio emission in the AERA frequency range. We measure 15.8 MeV of radiation energy for a 1 EeV air shower arriving perpendicularly to the geomagnetic field. This radiation energy -- corrected for geometrical effects -- is used as a cosmic-ray energy estimator. Performing an absolute energy calibration against the surface-detector information, we observe that this radio-energy estimator scales quadratically with the cosmic-ray energy as expected for coherent emission. We find an energy resolution of the radio reconstruction of 22% for the data set and 17% for a high-quality subset containing only events with at least five radio stations with signal.Comment: Replaced with published version. Added journal reference and DO
    corecore